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(54) Mobile terminal controllable by spoken utterances 



(57) A mobile terminal (100) controllable by spoken 
utterances like proper names or keywords is described. 
The mobile terminal (100) comprises an interface (200) 
for receiving voice prompts, a model generator (430) for 



generating acoustic models based on the received voice 
prompts, and an automatic speech recognizer (110) for 
recognizing the spoken utterances based on the gener- 
ated acoustic models. 
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Description 

BACKGROUND OF THE INVENTION 

1. Technical Field 

[0001] The invention relates to the field of automatic 
speech recognition and more particularly to a mobile ter- 
minal which is controllable by spoken utterances like 
proper names and command words. The invention fur- 
ther relates to a method for providing acoustic models 
for automatic speech recognition in such a mobile ter- 
minal. 

2. Discussion of the Prior Art 

[0002] Many mobile terminals like mobile telephones 
or personal digital assistants comprise the feature of 
controlling one or more functions thereof by means of 
uttering corresponding keywords. There exist, e. g., mo- 
bile telephones which allow the answering of a call or 
the administration of a telephone book by uttering com- 
mand words. Moreover, many mobile telephones allow 
so-called voice dialling which is initiated by uttering a 
person's name. 

[0003] Controlling a mobile terminal by spoken utter- 
ances necessitates employment of automatic speech 
recognition. During automatic speech recognition, a rec- 
ognition result is obtained by comparing previously gen- 
erated acoustic models with a spoken utterance ana- 
lyzed by an automatic speech recognizer. The acoustic 
models can be generated speaker dependently and 
speaker independently. 

[0004] Up to now, most mobile terminals employ 
speaker dependent speech recognition and thus speak- 
er dependent acoustic models. The use of speaker de- 
pendent acoustic models necessitates that an individual 
user of the mobile terminal has to train a vocabulary 
based on which automatic speech recognition is per- 
formed. The training is usually done by uttering a single 
keyword one or several times in order to generate the 
corresponding speaker dependent acoustic model. 
[0005] Speech recognition in mobile terminals based 
on speaker dependent acoustic models is not always an 
optimal solution. First of all, the requirement of a sepa- 
rate training for each keyword which is to be used for 
controlling the mobile terminal is time demanding and 
perceived as cumbersome by the user. Moreover, since 
the speaker dependent acoustic models are usually 
stored in the mobile terminal itself, the speaker depend- 
ent acoustic models generated by means of a training 
process are only available for this single mobile termi- 
nal. This means that if the user buys a new mobile ter- 
minal, the time demanding training process has to be 
repeated. ^ 
[0006] Because of the above drawbacks of speaker 
dependent speech recognition, mobile terminals some- 
times employ speaker independent speech recognition, 



i. e., speech recognition based on speaker independent 
acoustic models. There exist several possibilities for 
creating speaker independent acoustic models. If the 
spoken keywords for controlling the mobile terminal 

5 constitute a limited set of command words which are 
pre-defined, i. e., not defined by the user of the mobile 
terminal, the speaker independent references may be 
generated by averaging the spoken utterances of a 
large number of different speakers and may be stored 

10 in the mobile terminal prior to its sale. 

[0007] On the other hand, if the spoken keywords for 
controlling the mobile terminal can freely be chosen by 
the user a different method has to be applied. A compu- 
ter system for generating speaker independent refer- 

15 ences for freely chosen spoken keywords, i. e., key- 
words that are not known to the computer system, is de- 
scribed in EP 0 590 173 A1. The computer system an- 
alyzes each unknown spoken keyword and synthesizes 
a corresponding speaker independent acoustic model 

20 by means of a phonetic database. However, the com- 
puter system taught in EP 0 590 173 A1 comprises a 
huge memory and sophisticated computational resourc- 
es for generating the speaker independent acoustic 
models. These resources are generally not available in 

25 small and lightweight mobile terminals. 

[0008] As has become apparent from the above, there 
exist several reasons why at least a part of the acoustic 
models to be used for automatic speech recognition are 
not already stored in the mobile terminal upon produc- 

30 tion. Thus, it is often necessary to generate speaker de- 
pendent or speaker independent acoustic models after 
the mobile terminal has been delivered to the user. How- 
ever, up to now this involves sophisticated computation- 
al resources in case speaker independent acoustic 

35 models are used and cumbersome user training in case 
speaker dependent acoustic models are employed. 
[0009] There exists, therefore, a need for a mobile ter- 
minal which is controllable by spoken keywords based 
on speaker independent or speaker dependent acoustic 

40 models and which necessitates minimal efforts for gen- 
erating a new set or an additional set of acoustic models. 
There further exists a need for a method for providing 
acoustic models for automatic speech recognition in 
such a mobile terminal. 

45 

SUMMARY OF THE INVENTION 

[0010] The present invention satisfies this need by 
providing a mobile terminal which is controllable by spo- 

50 ken utterances like a proper name or a command word 
and which comprises an interface for receiving voice 
prompts, a model generator for generating acoustic 
models based on the received voice prompts, and an 
automatic speech recognizer for recognizing the spoken 

55 utterances based on the generated acoustic models. 
[0011] According to the invention, a method for pro- 
viding acoustic models for automatic speech recognition 
in a mobile terminal which is controllable by spoken ut- 
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terances comprises receiving voice prompts, generat- 
ing acoustic models based on the received voice 
prompts, and automatically recognizing the spoken ut- 
terances based on the generated acoustic models. 
[0012] Up to now, voice prompts have only been uti- 5 
lized in one and the same mobile terminal and only for 
providing an acoustic feedback e.g. Jor an user utter- 
ance recognized by the mobile terminal's automatic 
speech recognizer. The invention, however, proposes 
to configure a mobile terminal such that it may receive 
externally provided voice prompts which are subse- 
quently used in the mobile terminal for generating the 
acoustic models for automatic speech recognition. Con- 
sequently, the acoustic models are generated in a quick 
and easy manner based on voice prompts which may 
have been generated in advance. Moreover, in case of 
already existing voice prompts the mobile terminal ac- 
cording to the invention necessitates no cumbersome 
user training and only a slightly increase of the mobile 
terminal's hardware resources. 

[0013] The voice prompts used for generating the 
acoustic models are received from the mobile terminal 
via an interface. The interface can be a component 
which is configured or programmed to establish a con- 
nection to a voice prompt source which provides the 
voice prompts utilized for generating the acoustic mod- 
els in the mobile terminal. The voice prompt source may 
provide speaker dependent or speaker independent 
voice prompts such that speaker dependent or speaker 
independent acoustic models can be generated. The 
connection established by the interface to the voice 
prompt source can be a wired connection or a wireless 
connection operated e.g. according to a GSM, a UMTS, 
a blue-tooth or an IR standard. 

[0014] The generated acoustic models may be both 
speaker dependent and speaker independent. Howev- 
er, according to a preferred embodiment of the inven- 
tion, speaker dependent acoustic models are generated 
and speaker dependent voice prompts are used for gen- 
erating the speaker dependent acoustic models. Since 
the quality of speaker dependent voice prompts is often 
higher than the quality of e.g. synthetically synthesized 
speaker independent voice prompts, the recognition ac- 
curacy of automatic speech recognition which is based 
on speaker dependent acoustic models is also higher. 
[0015] The mobile terminal may comprise a voice 
prompt database for storing voice prompts. The voice 
prompts stored in the voice prompt database can be at 
least in part received via the mobile terminal's interface. 
At least some voice prompts stored in the voice prompt 
database can also be stored in the voice prompt data- 
base upon production of the mobile terminal or be gen- 
erated in the mobile terminal by e.g. recording an utter- 
ance of the mobile terminal's user. 
[0016] The mobile terminal may receive the voice 
prompts from a voice prompt source like an external de- 
vice (another mobile terminal, a personal digital assist- 
ant, a laptop, a network server, etc.).-The mobile termi- 



nal's interface is preferably in communication with the 
voice prompt database such that the interface enables 
to transfer the voice prompts received from the external 
device to the voice prompt database. The voice prompts 
transferred to the voice prompt database may then be 
stored permanently or temporarily in the voice prompt 
database. 

[0017] The voice prompt database may be irremova- 
bly attached to the mobile terminal or may be arranged 
on a physical carrier like a subscriber identity module 
(SIM) card which is removably connectable to the mo- 
bile terminal. If the voice prompt database is arranged 
on a physical carrier which is removably connectable to 
the mobile terminal, the mobile terminal's interface is 
preferably arranged between the model generator and 
the voice prompt database on the physical carrier. In this 
case the mobile terminal receives the voice prompts 
from the voice prompt database on the physical carrier 
via the mobile terminal's interface. In other words, the 
voice prompt database on the physical carrier consti- 
tutes the voice prompt source from which the mobile ter- 
minal receives the voice prompts utilized for generating 
acoustic models. The received voice prompts may be 
transferred to the mobile terminal's model generator 
which also communicates with the interface. 
[0018] The voice prompts used for generating the 
acoustic models can be received from the mobile termi- 
nal in various formats. According to one embodiment, 
the voice prompts are received in a format which may 
be readily played back as an acoustic feedback by the 
mobile terminal. Usually, this format of the voice 
prompts can directly be used for generating the acoustic 
models. 

[0019] According to a further embodiment, the voice 
prompts are received by the mobile terminal in an en- 
coded format. Often, voice prompts are stored in an en- 
coded format in order to allocate as few memory re- 
sources as possible. This, however, may necessitate 
that the voice prompts have to be decoded prior to play 
back or prior to the generation of acoustic models. Thus, 
the mobile terminal may comprise a decoding unit for 
decoding the encoded voice prompts prior to generating 
the acoustic models. The decoding unit is preferably ar- 
ranged between the voice prompt database and the 
model generator. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0020] Further aspects and advantages of the inven- 
tion will become apparent upon reading the following de- 
tailed description of preferred embodiments of the in- 
vention and upon reference to the figures, in which: 

Fig. 1 shows a schematic diagram of a first embodi- 
ment of a mobile terminal according to the in- 
vention; 

Fig. 2 shows a schematic diagram of a second em- 
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bodiment of a mobile terminal according to the 
invention; 

DESCRIPTION OF PREFERRED EMBODIMENTS 

[0021] In Fig. 1 a schematic diagram of a first exem- 
plary embodiment of a mobile terminal according to the 
invention in the form of a mobile telephone 100 with 
voice dialing functionality is illustrated. 
[0022] The mobile telephone 100 comprises an auto- 
matic speech recognizer 110 which receives a signal 
corresponding to a spoken utterance of a user from a 
microphone 120. The automatic speech recognizer 110 
is in communication with an acoustic model database 
130 in which acoustic models can be stored. During au- 
tomatic speech recognition, acoustic models are com- 
pared by the automatic speech recognizer 110 with the 
spoken utterances received via the microphone 120. 
[0023] The mobile telephone 100 additionally com- 
prises a unit 140 for generating an acoustic feedback 
for a recognized spoken utterance. As becomes appar- 
ent from Fig. 1, the unit 140 for outputting the acoustic 
feedback is in communication with a voice prompt da- 
tabase 150 in which voice prompts may be stored. The 
unit 140 generates an acoustic feedback based on voice 
prompts contained in the voice prompt database 1 50. 
The component 140 for outputting an acoustic feedback 
is further in communication with a loudspeaker 160 
which plays back the acoustic feedback received from 
the unit 140 for outputting the acoustic feedback. 
[0024] The mobile telephone 100. depicted in Fig. 1 
also comprises a SIM card 1 70 on which a transcription 
database 180 for storing textual transcriptions is ar- 
ranged. The SIM card 170 is removably connected to 
the mobile telephone 1 1 0 and contains a list with several 
textual transcriptions of spoken utterances to be recog- 
nized by the automatic speech recognizer 110. In the 
exemplary embodiment depicted in Fig. 1 , the transcrip- 
tion database 180 is configured as a telephone book and 
contains a plurality of telephone book entries in the form 
of names which are each associated with a specific tel- 
ephone number. As can be seen from the drawing, the 
first telephone book entry relates to the name "Tom" and 
the second telephone book entry relates to the name 
"Stefan". The textual transcriptions of the transcription 
database 1 80 are configured as ASCII character strings. 
Thus, the textual transcription of the first telephone book 
entry consists of the three characters T", "O" and "M". 
As can be seen from Fig. 1, each textual transcription 
of the database 180 has an unique index. The textual 
transcription "Tom", e. g., has the index "1". 
[0025] The transcription database 1 80 is in communi- 
cation with a unit 190 for outputting a visual feedback. 
The unit 190 for outputting the visual feedback is con- 
figured to display the textual transcription of a spoken 
utterance recognized by the automatic recognizer 110. 
[0026] The three databases 1 30, 1 50, 1 80 of the mo- 
bile telephone 100 are in communication with an inter- 



face 200 of the mobile telephone 1 00. The interface 200 
serves for receiving voice prompts from an external de- 
vice 300 like a further mobile telephone, a personal dig- 
ital assistant, a network server or a laptop by means of 
e.g. an infrared, a radio frequency or a wired connection. 
[0027] Basically, the interface 200 in the mobile tele- 
phone 100 can be separated internally into two blocks 
not depicted in Fig. 1. A first block is responsible to ac- 
cess in a read and write mode the acoustic model data- 
base 130, the voice prompt database 150 and the tex- 
tual transcription database 180. The second block real- 
izes the transmission of the data comprised within the 
databases 130, 150, 180 to the network server 300 us- 
ing a protocol description which guarantees a lossfree 
and fast transmission of the data. Another requirement 
on such a protocol is a certain level of security. Further- 
more the protocol should be designed in such a way that 
it is independent from the underlying physical transmis- 
sion medium , such as e.g. infraread (IR), Bluetooth, 
GSM, etc. Generally any kind of protocol (proprietary or 
standardized) fulfilling the above requirements could be 
used. An example for an appropriate protocol is the re- 
cently released SyncML protocol which synchronizes in- 
formation stored on two devices even when the connec- 
tivity is not guaranteed. Such a protocol would meet the 
necessary requirements to exchange voice prompts, 
acoustic models, etc. for speech driven applications in 
any mobile terminal. 

[0028] The mobile telephone 100 depicted in Fig. 1 
further comprises a training unit 400 coupled between 
the automatic speech recognizer 110 and the acoustic 
model database 130, an encoding unit 410 in commu- 
nication with both the microphone 120 and the voice 
prompt database 150, and a decoding unit 420 in com- 
munication with the voice prompt database 1 50, the unit 
140 for generating an acoustic feedback and a model 
generator 430. As can be seen from Fig. 1, the unit 140 
for outputting the acoustic feedback and the model gen- 
erator 430 communicate with the voice prompt database 
150 via the decoding unit 420. Of course, both the unit 
140 for outputting the acoustic feedback and the model 
generator 430 could be provided with a decoding func- 
tionality. In this case the separate decoding unit 420 
could be omitted. Moreover, the training unit 400 and 
the model generator 430 may be combined to a single 
training generation unit. 

[0029] By means of the training unit 400 and the cod- 
ing unit 410, the mobile telephone 100 depicted in Fig. 
1 creates speaker dependent acoustic models and 
speaker dependent voice prompts. The creation of 
acoustic models and voice prompts as well as further 
processes performed by the mobile telephone 100 are 
controlled by a central controlling unit not depicted in 
Fig. 1 

[0030] The mobile telephone 100 is controlled such 
that a user is prompted to utter each keyword like each 
proper name or each command word to be used for 
voice controlling the mobile telephone 100 one or sev- 
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era! times. The automatic speech recognizer 100 inputs 
the training utterance into the training unit 400 which 
works as a voice activity detector suppressing silence 
or noise intervals at the beginning and the end of each 
utterance. The thus filtered utterance is then acoustical- 
ly output to the user for confirmation. If the user confirms 
the filtered utterance, the training unit 400 stores a cor- 
responding speaker dependent acoustic model in the 
acoustic model database 130 in the form of a sequence 
of reference vectors. In the acoustic model database 
130 each generated acoustic model is associated with 
the index of a corresponding textual transcription. 
[0031] For each keyword to be trained, one training 
utterance selected by the user is input from the micro- 
phone 120 to the encoding unit 410 for encoding this 
utterance in accordance with a format that allocates few 
memory resources in the voice prompt database 150. 
The utterance is then stored as an encoded voice 
prompt in the voice prompt database 150. Thus, the 
voice prompt database 150 is filled with speaker de- 
pendent voice prompts. Each voice prompt stored per- 
manently in the voice prompt database 150 is associat- 
ed with the index of a corresponding textual transcrip- 
tion. When a voice prompt is to be played back, an en- 
coded voice prompt loaded from the voice prompt data- 
base 150 is decoded by the decoding unit 420 and 
passed on in a decoded format to the unit 140 for gen- 
erating an acoustic feedback. 

[0032] After the acoustic model database 1 30 and the 
voice prompt database 150 have been filled as ex- 
plained above, a telephone call can be set up by means 
of a spoken utterance. To set up a call, a user has to 
speak an utterance corresponding to a' textual transcrip- 
tion contained in the transcription database 180, e. g. 
"Stefan". This spoken utterance is converted by the mi- 
crophone 120 into a signal which is fed into the auto- 
matic speech recognizer 110. 

[0033] As pointed out above, the acoustic models are 
stored in the acoustic model database 130 as a se- 
quence of reference vectors. The automatic speech rec- 
ognizer 110 analyzes the signal from the microphone 
120 corresponding to the spoken utterance "Stefan" in 
order to obtain the reference vectors thereof. This proc- 
ess is called feature extraction. In order to generate a 
recognition result, the automatic speech recognizer 110 
matches the reference vectors of the spoken utterance 
"Stefan" with the reference vectors stored in the data- 
base 130 for each textual transcription. Thus, pattern 
matching takes place. 

[0034] Since the acoustic model database 130 actu- 
ally contains an acoustic model corresponding to the 
spoken utterance "Stefan", a recognition result in the 
form of the index "2", which corresponds to the textual 
transcription "Stefan", is output from the automatic 
speech recognizer 1 1 0 to both the unit 140 for outputting 
an acoustic feedback and the unit 190 for outputting the 
visual feedback. 

[0035] The unit 140 for outputting an acoustic feed- 



back loads the voice prompt corresponding to the index 
"2" from the voice prompt database 150 and generates 
an acoustic feedback corresponding to the word "Ste- 
fan". The acoustic feedback is played back by the loud- 

5 speaker 160. Concurrently, the unit 190 for outputting 
the visual feedback loads the textual transcription cor- 
responding to the index "2" from the transcription data- 
base 180 and outputs a visual feedback by displaying 
the character sequence "Stefan". 

10 [0036] The user may now confirm the acoustic and 
visual feedback and a call may be set up based on the 
telephone number which has the index "2". The acoustic 
and the visual feedback can be confirmed e. g. by press- 
ing a confirmation key of the mobile telephone 100 or 

15 by speaking a further utterance relating to a confirmation 
command word like "yes" or "call". Acoustic models and 
voice prompts for the confirmation command word and 
for other command words can be generated and stored 
in the same manner as described above in context with 

20 proper names. 

[0037] Usually, the lifecycle of a mobile telephone is 
rather short. If a user buys a new mobile telephone 100 
as depicted in Fig. 1 , he usually simply removes the SIM 
card 170 with the transcription database 180 from the 

25 old mobile telephone and inserts it into the new mobile 
telephone 100. Thus, the textual transcriptions, e.g., a 
telephone book, are immediately available in the new 
mobile telephone 100. However, the acoustic model da- 
tabase 130 and the voice prompt database 150 remain 

30 empty. 

[0038] In the prior art, the user thus has to repeat the 
same time consuming training process he already en- 
countered with the old mobile telephone in order to fill 
the acoustic model database 1 30 and the voice prompt 

35 database 150 with speaker dependent entries. Howev- 
er, according to the invention, the time consuming train- 
ing process for filling the databases 130, 150 can be 
omitted. This is due to the provision of the interface 200 
for receiving voice prompts. 

40 [0039] Via the interface 200 of the new mobile tele- 
phone 100 depicted in Fig. 1, a connection is estab- 
lished with a corresponding interface of the old mobile 
telephone 300. The old mobile telephone 300 may have 
the same construction as the new mobile telephone 100 

45 of Fig. 1. 

[0040] After the connection between the new mobile 
telephone 100 and the old mobile telephone 300 has 
been established, the contents of the voice prompt da- 
tabase of the old mobile telephone 300 may be trans- 

50 ferred via the interface 200 into the voice prompt data- 
base 150 of the new mobile telephone 100. The inter- 
face 200 also allows to transmit information relating to 
the transcription database 180 of the new mobile tele- 
phone 100 to the old mobile telephone 300 and to re- 

55 ceive corresponding information from the old mobile tel- 
ephone 300. The exchange of information relating to the 
textual transcriptions allows to control the transfer of 
voice prompts from the old mobile telephone 300 to the 
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new mobile telephone 100 such that only voice prompts 
for which a corresponding textual transcription in the 
transcription database 180 of the new mobile telephone 
exists are transferred. Moreover, it is^ensured that the 
voice prompts received from the old mobile telephone 
300 and stored in the voice prompt database 150 are 
associated with the correct index, i.e., the index of the 
corresponding textual transcription within the voice 
prompt database 150. The voice prompts stored in the 
voice prompt database 150 can be received from the 
old mobile telephone 300 in an encoded format or in a 
format which can be readily played back. In the following 
it is assumed that the voice prompts are received in an 
encoded format. 

[0041] According to a variant of the invention, the new 
mobile telephone 100 has a voice prompt database 150 
which is at least partly filled with indexed speaker inde- 
pendent voice prompts. The speaker independent voice 
prompts may have been pre-stored for a plurality of 
command words during production of the new mobile 
telephone 100. Using the indices of pre-stored speaker 
independent voice prompts, those pre-stored voice 
prompts of the new mobile telephone 100 for which in 
the old mobile telephone 300 correspondingly indexed 
and user-trained speaker dependent voice prompts ex- 
ist are replaced by the user trained voice prompts. Thus, 
the new mobile telephone's 100 recognition accuracy is 
increased since based on the speaker dependent voice 
prompts which replace the speaker independent voice 
prompt more accurate acoustic models can be generat- 
ed. 

[0042] According to a further variant of the invention, 
the interface 200 of the new mobile telephone 100 can 
be configured such that it allows to receive both textual 
transcription as well as corresponding voice prompts 
from the old mobile telephone 300. Thus, if the transcrip- 
tion database 180 of the new mobile telephone 100 is 
empty or only partly filled, the interface 200 allows to 
transfer both textual transcriptions and corresponding 
voice prompts from the old mobile telephone 300 to the 
new mobile telephone 100. 

[0043] As has become apparent from the above, by 
means of the interface 200 the voice prompt database 
150 and, if desired, the transcription database 180 of 
the new mobile terminal 100 can be'filled with corre- 
sponding data from the old mobile telephone 300. How- 
ever, the acoustic model database 1 30 of the mobile ter- 
minal 100 still remains empty. Thus, in a next step, the 
acoustic model database 130 has to be filled by means 
of the model generator 430 as set out below. 
[0044] In order to fill the acoustic model database 1 30, 
the voice prompts are transferred from the voice prompt 
database 1 50 to the model generator 430 via the decod- 
ing unit 420. In the decoding unit 420 the encoded voice 
prompts are decoded into a format which can be readily 
played back. Then, the voice prompts are transferred in 
this decoded format from the decoding unit 420 to the 
model generator 430. Based on the decoded voice 



prompts received from the decoding unit 420 the model 
generator 430 calculates for each voice prompt a se- 
quence of reference vectors, each sequence of refer- 
ence vectors constituting the acoustic model which cor- 

5 responds to the specific voice prompt. After the acoustic 
models have been generated, they are transferred by 
the model generator 430 into the acoustic model data- 
base 1 30. In the acoustic model database each acoustic 
model generated by the model generator 430 is associ- 

10 ated with the index of the corresponding voice prompt 
and the corresponding textual transcription. 
[0045] Since the acoustic models are generated 
based on the voice prompts received from the old mobile 
terminal 300, a high degree of compatibility regarding 

15 different generations or different models of mobile tele- 
phones can be ensured. Even if the reference vectors 
used for automatic speech recognition by the old mobile 
telephone 300 and the new mobile telephone 100 are 
not compatible, the inventive concept can be applied 

20 since not the acoustic models themselves but the voice 
prompts are exchanged between the new mobile tele- 
phone 100 and the old mobile telephone 300. Generat- 
ing the acoustic models in the new mobile telephone 100 
based on the voice prompts received from the old mobile 

25 telephone 300 thus ensures a high compatibility. The 
mobile telephones 100 and 300 depicted in Fig. 1 are 
preferably configured such that the voice prompts can 
be exchanged even if the mobile telephones 100, 300 
are operated without a SIM card 170. 

30 [0046] In Fig. 2, a second embodiment of a mobile tel- 
ephone 100 according to the invention is illustrated. The 
mobile telephone 100 depicted in Fig. 2 has a similar 
construction like the mobile telephone depicted in Fig. 
1. Again, the mobile telephone 100 comprises an inter- 
ns face 200 for receiving speaker independent or speaker 
dependent voice prompts. 

[0047] In contrast to the mobile telephone 100 depict- 
ed in Fig. 1, however, both the voice prompt database 
1 50 and the transcription database 1 80 are arranged on 

40 the removable SIM card 170. Moreover, the interface 
200 for receiving voice prompts is not configured to es- 
tablish a connection to an external device but enables 
establishing a connection between the voice prompt da- 
tabase 150 on the removable SIM card 170 and the mo- 

45 bile terminal 100. The interface 200 may e.g. be an ap- 
propriately configured connector. 
[0048] The interface 200 is in communication with the 
voice prompt database 1 50 and also communicates with 
both the microphone 120 and the model generator 430 

so via the encoding unit 410 and the decoding unit 420, 
respectively. Although not depicted in Fig. 2, a commu- 
nication between the transcription database 180 and 
one or more components of the mobile terminal 100 like 
the unit 1 90 for outputting an visual feedback could also 

55 take place via the interface 200. 

[0049] If a SIM card 170 with an empty voice prompt 
database 150 is inserted into the mobile terminal 100 
depicted in Fig. 2, the empty voice prompt database 150 
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may be filled as described above with respect to the mo- 
bile terminal of Fig. 1 . The only difference is that the en- 
coding unit 4 1 0 communicates with the voice prompt da- 
tabase 1 50 not directly but via the interface 200. 
[0050] If a SIM card 170 with an at least partly filled 
voice prompt database 150 is inserted into the mobile 
terminal 100, and if the acoustic model database 130 
does not already contain an acoustic model for each 
voice prompt stored in the voice prompt database 150, 
the acoustic model database 130 may be filled by 
means of the model generator 430 as was explained 
above in respect to the mobile terminal depicted in Fig. 
1 . The mobile terminal 1 00 of Fig. 2 receives the speaker 
independent or speaker dependent voice prompts 
based on which acoustic models have to be generated 
via the interface 200 from the voice prompt database 
150. After decoding in the decoding unit 420, the re- 
ceived voice prompts are transferred to the model gen- 
erator 430. The model generator 430 then generates for 
each voice prompt a set of reference vectors and each 
set of reference vectors is stored as an acoustic model 
in the acoustic model database 130. 
[0051] Once the SIM card 170 with the voice prompt 
database 1 50 has been inserted into the mobile terminal 
100, generation of the acoustic models is preferably trig- 
gered by switching on the mobile terminal 100. Conse- 
quently, the voice prompts and the recognition referenc- 
es corresponding to the acoustic models are immedi- 
ately available without training or recording. 
[0052] Of course, the inventive concept described 
above exemplary with respect to the mobile terminals 
depicted in Figs. 1 and 2 can also be employed in con- 
text with other mobile terminals like personal digital as- 
sistants or laptops. 



Claims 

1. A mobile terminal (100) controllable by spoken ut- 
terances, comprising: 

an interface (200) for receiving voice prompts; 

a model generator (430) for generating acous- 
tic models based on the received voice 
prompts; and 

an automatic speech recognizer (110) for rec- 
ognizing the spoken utterances based on the 
generated acoustic models. 

2. The mobile terminal according to claim 1, 
further comprising a voice prompt database (150) 
for storing voice prompts. 

3. The mobile terminal according to claim 2, 
wherein the interface (200) is in communication with 
the voice prompt database (1 50) and enables trans- 



fer of voice prompts received from an external de- 
vice (300) to the voice prompt database (150). 

4. The mobile terminal according to claim 2, 

5 wherein the voice prompt database (150) is ar- 
ranged on a physical carrier (170) removably con- 
nectable to the mobile terminal (100). 

5. The mobile terminal according to claim 4, 
10 wherein the interface (200) is in communication with 

the model generator (430) and enables transfer of 
the voice prompts stored in the voice prompt data- 
base ( 1 50) on the physical carrier (1 70) to the model 
generator (430). 

15 

6. The mobile terminal according to claim 4 or 5, 

wherein the physical carrier (170) is a SIM 

card. 

20 7. The mobile terminal according to one of claims 1 to 
6, further comprising a decoding unit (420) ar- 
ranged between the voice prompt database (150) 
and the model generator (430). 

25 8. A method for providing acoustic models for auto- 
matic speech recognition in a mobile terminal (100) 
controllable by spoken utterances, comprising: 

receiving voice prompts; 

30 

generating acoustic models based on the re- 
ceived voice prompts; and 

automatically recognizing the spoken utteranc- 
35 es based on the generated acoustic models. 

9. The method according to claim 8, 

wherein the voice prompts are received via an in- 
terface (200) of the mobile terminal (100). 

40 

10. The method according to claim 8 or 9, 

further comprising receiving the voice prompts from 
an external device (300). 

45 11. The method according to one of claims 8 to 10, 

further comprising storing the received voice 
prompts. 

12. The method according to claim 8 or 9, 
50 further comprising receiving the voice prompts from 
a voice prompt database (150) arranged on a phys- 
ical carrier (170) which is removably connectable to 
the mobile terminal (100). 

55 13. The method according to one of claims 8 to 12, 

wherein the voice prompts are received in an en- 
coded format and further comprising decoding the 
voice prompts prior to generating the acoustic mod- 
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els. 

14. The method according to one of claims 8 to 13, 
wherein speaker dependent acoustic models are 
generated. 5 
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